Future world cancer death rate prediction

Cancer is a worldwide illness that causes significant morbidity and death and imposes an immense cost on global public health. Modelling such a phenomenon is complex because of the non-stationarity and complexity of cancer waves. Apply modern novel statistical methods directly to raw clinical data. To estimate extreme cancer death rate likelihood at any period in any location of interest. Traditional statistical methodologies that deal with temporal observations of multi-regional processes cannot adequately deal with substantial regional dimensionality and cross-correlation of various regional variables. Setting: multicenter, population-based, medical survey data-based biostatistical approach. Due to the non-stationarity and complicated nature of cancer, it is challenging to model such a phenomenon. This paper offers a unique bio-system dependability technique suited for multi-regional environmental and health systems. When monitored over a significant period, it yields a reliable long-term projection of the chance of an exceptional cancer mortality rate. Traditional statistical approaches dealing with temporal observations of multi-regional processes cannot effectively deal with large regional dimensionality and cross-correlation between multiple regional data. The provided approach may be employed in numerous public health applications, depending on their clinical survey data.

The National Cancer Institute defines cancer as a group of disorders in which aberrant cells may proliferate and invade neighbouring tissue. Cancer may develop in most regions of the body, resulting in various cancer forms, as indicated below, and can sometimes spread via the blood and lymph systems.
Cancer's statistical characteristics received much attention from the current scientific community [1][2][3][4][5][6][7][8] . Using current theoretical statistical methods [9][10][11][12][13][14][15] , it is often rather challenging to compute realistic biological system dependability factors and outbreak probability under actual cancer settings. Typically, this results from many degrees of system freedom and random variables driving vastly dispersed dynamic biological systems. In theory, the dependability of a complex biological system may be precisely evaluated using sufficient observations or direct Monte Carlo simulations. Beginning in 1990, however, a portion of the available cancer observation numbers are limited [16][17][18][19][20][21] . Motivated by the latter point, the authors have developed a unique dependability technique for biological and health systems to forecast and control cancer epidemics more precisely. The whole globe was selected because of the enormous internet health observations and associated research 1 .
In health and engineering fields, statistical modelling of lifetime data and extreme value theory (EVT) are widespread. For example, Gumbel utilised EVT to predict the demography of distinct communities in [20][21][22][23] . Recent papers arguing for and against the upper bounds distribution of life expectancy were done by 24 . Often, papers in these fields presume a parametric bivariate lifetime distribution obtained from the exponential distribution to get statistically relevant data 24 . In 25 , the author proposes a new approach that uses Power Variance Function copulas (e.g., Clayton, Gumbel and Inverse Gaussian copulas), conditional sampling, and numerical approximation used in survival analysis. While in a paper by 26 , the authors explain that EVT has been used to predict mutation in evolutionary genetics and further develop a likelihood framework from EVT that was used to determine the fitness effects of the mutation.
Similarly, in 27 , The author applies a Beta-Burr distribution to this EVT hypothesis to calculate the fitness impact. While in 28 , the author presents a bivariate logistic regression model, which was afterwards used to access multiple MS fatalities with walking difficulties and in a cognitive experiment for visual identification. Finally 3 , is a relevant work utilising EVT to evaluate the chance of a global cancer breakout. In 22,23 , similarly, researchers employed EVT to predict and identify cancer abnormalities.
In this research, a cancer outbreak is seen as an unanticipated occurrence that may occur in any location of a nation at any moment; hence, the spatial spread is considered. Moreover, a specific non-dimensional factor is introduced to forecast the cancer risk at any given time and location. Environmental impacts on biological systems are ergodic. The second possibility is to see the process as reliant on specific external characteristics www.nature.com/scientificreports/ whose time-dependent change may be modelled as an ergodic process on its own. The incidence data of cancer in one hundred ninety-five world countries during the years 1990-2019 were retrieved from the public website 1 , considered a multi-degree-of-freedom (MDOF) spatio-temporal dynamic bio-system with highly inter-correlated regional components/dimensions. This research tries to reduce the danger of future cancer outbreaks by forecasting them. However, it focuses simply on the yearly number of documented patient deaths and not on the symptoms themselves. Figure 1 presents the map of the world's countries.
Further research should incorporate one of the common complexity measures, such as fractal, attractor/ embedding dimension, and entropy.

Methods
Consider an MDOF (multi-degree of freedom) structure subjected to random ergodic environmental factors (stationary in time). The second possibility is to see the process as reliant on certain external characteristics whose time-dependent change may be modelled as an ergodic process on its own. The MDOF biomedical response vector process R(t) = (X(t), Y (t), Z(t), . . . ) is measured and/or simulated over a sufficiently long time interval (0, T) . Unidimensional global maxima over the duration of time (0, T) are denoted as X max , . . ..By sufficiently long time T one primarily means a large value of T with respect to the dynamic system auto-correlation time [33][34][35][36][37][38][39][40] .
Let X 1 , . . . , X N X be consequent in time local maxima of the process X(t) at monotonously increasing discrete time instants t X 1 < · · · < t X N X in (0, T) . The analogous definition follows for other MDOF response components Y (t), Z(t), . . . with Y 1 , . . . , Y N Y ; Z 1 , . . . , Z N Z and so on. For simplicity, all R(t) components, and therefore its maxima are assumed to be non-negative. The aim is to estimate the system failure probability www.nature.com/scientificreports/ being the probability of non-exceedance for response components η X , η Y , η Z ,… critical values; ∪ denotes logical unity operation; and p X max .. being joint probability density of the global maxima over the entire time span (0, T).
In practice, it is not possible to accurately estimate the latter joint probability distribution p X max due to its high dimensionality and available data set limitations. In other words, the time instant when either X(t) exceeds η X , or Y (t) exceeds η Y , or Z(t) exceeds η Z , and so on, the system being regarded as immediately failed. Fixed failure levels η X , η Y , η Z ,…are of course individual for each unidimensional response component of . In this case t j represents local maxima of one of MDOF bio-system response components either X(t) or Y (t) , or Z(t) and so on. That means that having R(t) time record, one just has to continually and concurrently screen for local maximums of unidimensional response components and record their exceeding the MDOF limit vector (η X , η Y , η Z , ...) in any of its components X, Y , Z, . . . . The maxima of local unidimensional response components are blended into a non-decreasing temporal vector − → R = (R 1 , R 2 , . . . , R N ) in accordance with the merged time vector t 1 ≤ · · · ≤ t N . That is to say, each local maxima R j is the actual encountered local maxima corresponding to either X(t) or Y (t) , or Z(t) and so on. Finally, the unified limit vector (η 1 , . . . , η N ) is introduced with each component η j is either η X , η Y or η Z and so on, depending on which of X(t) or Y (t) , or Z(t) etc., corresponding to the current local maxima with the running index j.
Next, a scaling parameter 0 < ≤ 1 is implemented to artificially lower limit values for all response components concurrently, namely the new MDOF limit vector η η z and so on. The latter automatically defines probability P( ) as a function of , note that P ≡ P(1) from Eq. (1). Non-exceedance probability P( ) can be now estimated as follows In practice, a dependency between neighbouring R j is not always negligible; thus, the following one-step (called here conditioning level k = 1 ) memory approximation is introduced for 2 ≤ j ≤ N (called here conditioning level k = 2 ). The approximation introduced by Eq. (4) can be further expressed as where 3 ≤ j ≤ N (will be called conditioning level k = 3 ), and so on. The goal is to monitor each isolated failure that occurs locally first in time, thereby preventing cascade local inter-correlated exceedances.
Equation (5) presents subsequent refinements of the statistical independence assumption. The latter type of approximation enables capturing the statistical dependence effect between neighbouring maxima with increased accuracy. Since the original MDOF bio-process R(t) was assumed ergodic and therefore stationary, the probability p k ( ): = Prob {R j > η j | R j−1 ≤ η j−1 , R j−k+1 ≤ η j−k+1 } for j ≥ k will be independent of j but only dependent on conditioning level k . Thus non-exceedance probability can be approximated as in the Naess-Gaidai method 29,30 , where Note that Eq. (6) follows from Eq. (1) by neglecting Prob(R 1 ≤ η 1 ) ≈ 1 , as the design failure probability is usually very small. Further, it is assumed N"k . Note that Eq. (5) is similar to the well-known mean up-crossing rate equation for the probability of exceedance 32 . There is obvious convergence with respect to the conditioning parameter k Note that Eq. (6) for k = 1 turns into the quite well-known non-exceedance probability relationship with the mean up-crossing rate function www.nature.com/scientificreports/ where ν + ( ) is the mean up-crossing rate of the response level for the above assembled non-dimensional vector . . . Note that constructed − → R -vector has no data loss at all; see Fig. 2.
In the preceding, the assumption of stationarity has been employed. The proposed methodology can also treat the non-stationary case. An illustration of how the methodology can be used to treat non-stationary cases is provided. Consider a scattered diagram of m = 1, .., M environmental states, each short-term bio-environmental state having a probability q m , so that M m=1 q m = 1 . The corresponding long-term equation is then with p k ( , m) being the same function as in Eq. (7) but corresponding to a specific short-term environmental state with the number m . The above introduced p k ( ) as functions are often regular in the tail, specifically for values of approaching and exceeding 1 . More precisely, for ≥ 0 , the distribution tail behaves similarly to exp −(a + b) c + d with a, b, c, d being suitably fitted constants for suitable tail cut-on 0 value. Therefore, one can write Next, by plotting ln ln p k ( ) − d k versus ln(a k + b k ) , often nearly perfectly linear tail behaviour is observed. Optimal values of the parameters a k , b k , c k , p k , q k may also be determined using a sequential quadratic programming (SQP) method incorporated in the NAG Numerical Library 31 .

Results
Predictions of cancer-related mortality have been the focus of epidemiology and mathematical biology for a long time. It is common knowledge that the dynamics of public health are a highly non-linear, multidimensional, spatially cross-correlated dynamic system that is always difficult to analyse. Previous studies have used a variety of approaches to model cancer cases. This section presents the application of the above-described methodology to the real-life cancer data sets, presented as a new annual recorded time series for all world countries. The statistical information presented in this section was obtained from the official World website 1 . The website provides cancer death rates per country from 1990 to 2019. Patient death numbers from one hundred ninety-five different world countries were chosen as components X, Y , Z, ... , thus constituting an example of a one hundred ninety-five dimensional (195D) dynamic biological system. To unify all 195 measured time series X, Y , Z, . . . the following scaling was performed with the whole vector − → R being sorted according to non-decreasing times of occurrence of these local maxima. Figure 3 presents the number of new annual recorded deaths as a 195D vector − → R , consisting of assembled regional new annual death rate for each corresponding country. Greenland, Mongolia, Monaco and Hungary data were excluded from analysis, since were regarded as outliers. Note that vector − → R is assembled of different regional components with different cancer backgrounds. Index j is just a running index of local maxima encountered in a non-decreasing time sequence. Figure 4 presents the annual death rate (percentage of deaths from cancer to the population of a given country) prediction, 100 years return level extrapolation according to Eq. (10) towards cancer outbreak with a 100-year return period, indicated by the horizontal dotted line. Somewhat beyond, = 0.18 % cut-on value was used, percentage of the local population on the horizontal axis. The dotted lines indicate extrapolated 95%  www.nature.com/scientificreports/ confidence interval according to Eq. (11). According to Eq. (5) p( ) is directly related to the target failure probability 1 − P from Eq. (1). Therefore, in agreement with Eq. (5), system failure probability 1 − P ≈ 1 − P k (1) can be estimated. Note that in Eq. (6), N corresponds to the total number of local maxima in the unified response vector − → R . Conditioning parameter k = 3 was found to be sufficient due to occurrence of convergence with respect to k , see Eq. (6). Figure 4 exhibits reasonably narrow 95% CI. The latter is an advantage of the proposed method.
The predicted cancer death rate in any world country in any year to come for the next 100 years was found to be about 0.24%.
Note that, although being unique, the above-described technique has the distinct benefit of using existing measured data sets very effectively owing to its capacity to deal with the multidimensionality of the health system and to execute correct extrapolation using relatively small data sets.Note that the predicted non-dimensional level, indicated by the star in Fig. 4, represents the probability of cancer outbreak in any world country in the years to come.
In order to validate the suggested methodology, a twice smaller data set was used to obtain predictions for the same probability levels of interest as in Fig. 4. The twice smaller data set was obtained from the original data set by sampling every second consecutive data point. Predicted , based on reduced data set, was found within 95% CI based on the entire data set, indicated in Fig. 4.
The second-order difference plot (SODP) originated from the Poincare plot. SODP provides observing the statistical situation of consecutive differences in time series data. Figure 5 presents SODP along with a third-order difference plot TODP and a fourth-order difference plot FODP. These kinds of plots can be used for data pattern recognition and comparison with other data sets, for example, for the entropy artificial intelligence (AI) recognition approach 32 . Note that EVT is asymptotic and 1DOF, while this study introduces MDOF and sub-asymptotic approaches. To summarise, the predicted nondimensional λ level, indicated by the star in Fig. 4, represents the probability of world cancer deaths in the years to come. The methodology's limitation lies in its assumption of the underlying bio-environmental process quasi-stationarity.

Discussion
Traditional health systems reliability methods dealing with observed time series do not have the advantage of dealing efficiently with systems possessing high dimensionality and cross-correlation between different system responses. The essential advantage of the introduced methodology is its ability to study the reliability of high dimensional non-linear dynamic systems.
Despite the simplicity, the present study successfully offers a novel multidimensional modelling strategy and a methodological avenue to implement forecasting of the cancer death rate. Proper setting of health system alarm limits (failure limits) per country has been discussed.
This paper studied recorded cancer death rates from all world countries, constituting an example of a one hundred ninety-five dimensional (195D) observed from 1990 to 2019. In real-time, the novel reliability method was applied to cancer annual death rate numbers as a multidimensional system. The theoretical reasoning behind the proposed method is given in detail. Note that the use of direct either measurement or Monte Carlo simulation for dynamic biological system reliability analysis is attractive; however, dynamic system complexity and its high dimensionality require the development of novel robust and accurate techniques that can deal with a limited data set at hand, utilising available data as efficient as possible.
The main conclusion is that the public health system under local environmental and epidemiologic conditions is well managed. This study predicted an annual death rate 100-year return period risk level equal to about 0.24%. Therefore, under current national health management conditions, cancer still represents a future threat to world health.
This study further aimed to develop a general-purpose, robust, and straightforward multidimensional reliability method. The method introduced in this paper has been previously validated by application to a wide range of simulation models, but for only one-dimensional system responses and, in general, very accurate predictions were obtained. Both measured and numerically simulated time series responses can be analysed. It is shown that the proposed method produced a reasonable confidence interval. Thus, the suggested methodology may become appropriate for various non-linear dynamic biological systems reliability studies. Finally, the suggested methodology can be used in many public health applications. The presented cancer example does not limit areas of new method applicability (Supplementary file).
The suggested method can work well with non-stationary data sets (for example, seasonal variations) as soon as they represent the proof of interest. If, however, there is an underlying trend in the process of interest or the data was manipulated, those effects have to be identified. In that case, trend analysis should be performed, a topic for future studies. In any case, authors assume that within 3 years, horizon quasi-stationarity may be assumed. Therefore, the limitation of this study lies within the assumption of bio-system quasi-stationarity, which is, of course, not valid for many years to come.

Data availability
The datasets analysed during the current study are available online 1 https:// ourwo rldin data. org/ causes-of-death. The authors confirm that all methods were performed following the relevant guidelines and regulations according to the Declarations of Helsinki.

Code availability
For software used to extrapolate probability tails in this study, see https:// github. com/ cran/ acer.